An Amharic Stemmer : Reducing Words to their Citation Forms
نویسندگان
چکیده
Stemming is an important analysis step in a number of areas such as natural language processing (NLP), information retrieval (IR), machine translation(MT) and text classification. In this paper we present the development of a stemmer for Amharic that reduces words to their citation forms. Amharic is a Semitic language with rich and complex morphology. The application of such a stemmer is in dictionary based cross language IR, where there is a need in the translation step, to look up terms in a machine readable dictionary (MRD). We apply a rule based approach supplemented by occurrence statistics of words in a MRD and in a 3.1M words news corpus. The main purpose of the statistical supplements is to resolve ambiguity between alternative segmentations. The stemmer is evaluated on Amharic text from two domains, news articles and a classic fiction text. It is shown to have an accuracy of 60% for the old fashioned fiction text and 75% for the news articles.
منابع مشابه
Stemming Hausa text: using affix-stripping rules and reference look-up
Stemming is a process of reducing a derivational or inflectional word to its root or stem by stripping all its affixes. It is been used in applications such as information retrieval, machine translation, and text summarization, as their preprocessing step to increase efficiency. Currently, there are a few stemming algorithms which have been developed for languages such as English, Arabic, Turki...
متن کاملStemmer for Serbian language
In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. In this work is presented suffix-stripping stemmer for Serbian language, one of the highly inflectional languages.
متن کاملTelStem:An Unsupervised Telugu Stemmer with Heuristic Improvements and Normalized Signatures
Stemming is a technique for reducing variant forms of a word to their roots (or stems) by enabling extraction of common suffixes. Stem need not correspond to the linguistic root of a word. Stemming is predominantly used in IR system to enrich retrieval effectiveness and to reduce the size of index for information retrieval task. This paper presents a systematic way of algorithm and implementati...
متن کاملCitation Behaviours of Applied Linguists in Discussion Sections of Research Articles
It is now generally accepted that academic writing is a social activity by which the authors negotiate with their audience to gain community acceptance for their findings. One of the ways to achieve such an acceptance is by establishing intertextual links to prior research using citation. Despite a vast research on the topic and suggestion of typologies for the form and function of citation in ...
متن کاملDictionary-based Amharic - English Information Retrieval
We present two approaches to the Amharic – English bilingual track in CLEF 2004. Both experiments use a dictionary based approach to translate the Amharic queries into English Bags-of-words, but while one approach removes non-content bearing words from the Amharic queries based on their IDF value, the other uses a list of English stop words to perform the same task. The resulting translated (En...
متن کامل